Welcome back in our Hands on section for processing of spectroscopic data in Python. At this point, we suppose you have already went through last article, considering data formats and importing. You should have imported contest dataset and loaded following variables:

  • trainData
  • trainClass
  • wavelengths
  • testData

Today, we will use all mentioned variables except testData. Firstly, just simple plotting and visualization, followed by demonstration of the PCA.

Let's start with importing of useful libraries. matplotlib and numpy are just essential Python libraries used almost always. However, the seaborn library adjust appearance of figures, created by Matplotlib (similar to beloved ggplot-style plots used in R). pandas offers great tools for handling large datasets and operations on them.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set()

The most basic visualization task is to plot a spectrum. We need to create Pandas DataFrame object to combine wavelength values and corresponding intensities. Here, we would like to plot the first spectrum of the dataset which is indexed as 0 (beware of this indexing if you are coming from MATLAB or R).

In [0]:
df=pd.DataFrame({'wave': wavelengths, 'intensity': trainData[0,:] })

fig = plt.figure()
f1 = fig.add_subplot(1, 1, 1)
f1.plot('wave', 'intensity', data=df)
f1.set_xlabel('Wavelength (nm)')
f1.set_ylabel('Intensity (-)')
fig.set_size_inches(12, 6)

While this simple matplotlib figure is often all you need, we will demonstrate more advanced, interactive plotting.

Let's continue with exploring of the benchmark dataset (check out more details about its complexity). It is often beneficial to average spectra in large datasets to obtain some general information. We will create a mean spectrum for each class. For this, numpy library offers the simplest solution. In total, there are 12 classes.

In [0]:
trainData_mean = []
for i in range(12):
    trainData_mean.append(np.mean(trainData[trainClass == i], axis=0))

Now, we could use again matplotlib as above, but for interactive handling, the plotly library is better option. Here, we define a function for plotting with two arguments. A cycle is used to add spectra for all classes.

In [0]:
import plotly.graph_objects as go
def plot_interactive(y_val , x_val):
    fig = go.Figure()
    for i in range(12):
        fig.add_trace(go.Scatter(x=x_val, y=y_val[:,i],
                    mode='lines',
                    name= str (i+1) + '. class'))

    fig.update_layout(
        title = "Mean spectrum for each class",
        xaxis_title = "Wavelength (nm)",
        yaxis_title = "Relative intensity (-)"
    )
    fig.show()

Finally, we may apply the function to our data. Our data matrix is transposed to be prepared for plotting. Explore broad interactivity of the plotly figures (zoom in, click on the legend to select specific spectra to plot and much more)!

In [0]:
plot_interactive(np.transpose(trainData_mean), wavelengths)
As was promised, we will continue with simple PCA implementation using scikit-learn library. In [88]: from sklearn.decomposition import PCA pca = PCA(n_components=20) pca_scores = pca.fit_transform(trainData) fig = plt.figure() fig.set_size_inches(16, 9) f2 = fig.add_subplot(1, 1, 1) plt.scatter(pca_scores[:, 0], pca_scores[:, 1], c=trainClass, edgecolor='none', alpha=1, cmap='Paired' ) plt.colorbar(ticks=np.linspace(1,12,12), label='Class') plt.clim(0.5, 12.5) f2.set_xlabel('Principal component 1') f2.set_ylabel('Principal component 2') Out[88]: Text(0, 0.5, 'Principal component 2')